Using programming to get stories

Louis Goddard

Who am I?

Work

  • Did a PhD in English from 2013 and freelanced

  • Data journalist at The Times and Sunday Times since 2016

  • Data advisor at Global Witness from next month

Global Witness logo

PhD work

Electric cars Sugar daddies Child abuse Passwords Trains Church land Oxbridge places Ski doping Consultations Corbyn Antidepressants Gagging clauses 1 US campaign finance Gagging clauses 2 School funding Tax 1 Tax 2

Interests

  • Cyber-crime and the dark net

  • Transparency and open data

  • Bringing innovative techniques to data journalism

Tools

  • R programming language

  • Tidyverse family of packages

  • Elasticsearch and Kibana

R logo

How can programming help us get stories out of data?

The three As

  • Access

  • Amalgamation (OK, this one’s not great)

  • Analysis

Access

  • People talk a lot about creating new data

  • Another way to think of it: accessing hidden data

  • Examples: web scraping, using APIs, getting data from PDFs and Word documents, working with data bigger than Excel can handle

Amalgamation

  • 2 + 2 = 5

  • When combined, data is more than the sum of its parts

  • Examples: joins and fuzzy matching, working with weird file formats, tidying data

Analysis

  • How can programming help us see stories?

  • Visualisation is important, but not the be all and end all

  • Examples: geospatial analysis, times series analysis, statistical analysis, search

Access

Web scraping

  • Getting information from a website into structured form

  • 90% of scraping jobs for stories follow this format:

    1. Page through an index to get links to individual items
    2. Load each item and extract some data
    3. Combine the results

Consultations

Consultations screen 1

Consultations screen 2

Consultations screen 3

Consultations screen 4

Web scraping

  • Key tools: rvest and SelectorGadget
##   [1] "Alice Thomson"                                                                     
##   [2] "Isa essentials"                                                                    
##   [3] "Anna Murphy"                                                                       
##   [4] "Greatest matches"                                                                  
##   [5] "Delay Brexit till June, May tells Brussels"                                        
##   [6] "Risk of no-deal has risen sharply"                                                 
##   [7] "The letter in full"                                                                
##   [8] "Why the EU holds all the cards"                                                    
##   [9] "Pilots scoured 737 manual as doomed jet plunged"                                   
##  [10] "Maitlis is Newsnight’s lead presenter in all‑female line‑up"                       
##  [11] "Waterfall death tourists were not told of dangers"                                 
##  [12] "It bloody well will change me, says £71m lottery winner"                           
##  [13] "UK has more to smile about despite Brexit split"                                   
##  [14] "Reward at last for the girl who saved a village from the Nazis"                    
##  [15] "Cataract surgery doesn’t work, says NHS in cost-cutting drive"                     
##  [16] "Girl, 9, survives cancer after proton beam therapy"                                
##  [17] "Family pulls £1m gallery donation over opioid scandal"                             
##  [18] "Sackler grant withdrawal may be a watershed in arts funding"                       
##  [19] "Splash of milk in your tea can reduce throat cancer risk"                          
##  [20] "Economic boost as wages rise and employment hits new high"                         
##  [21] "Journalist faces police for ‘misgendering’ trans woman"                            
##  [22] "Why we like our homes to be as warm as Africa"                                     
##  [23] "Mess up Brexit and none of you will be PM, May warns cabinet"                      
##  [24] "1604, the year it all kicked off"                                                  
##  [25] "‘Cordial’ chat fails to bring Johnson back into the fold"                          
##  [26] "Canada deal close but just days left for dozens more"                              
##  [27] "Grassroots Brexiteers seize chance to oust MPs"                                    
##  [28] "Westminster ‘dithering’ condemned by business"                                     
##  [29] "Watchdogs fine Leave campaign groups"                                              
##  [30] "Corbyn machine chews and spits out defectors"                                      
##  [31] "Health secretary is warned over perils of genetic screening"                       
##  [32] "Chaos in court at Soubry abuse case"                                               
##  [33] "I can’t forgive lying MP, says vicar"                                              
##  [34] "Is it really worse not to be talked about?"                                        
##  [35] "You’re a misogynist bully, says minister"                                          
##  [36] "That sound is leadership klaxon at cabinet"                                        
##  [37] "Ten minutes’ exercise each week cuts risk of early death by a fifth"               
##  [38] "Officer covered up her links to juror"                                             
##  [39] "Boy, 17, stabbed in neck near Jodie murder spot"                                   
##  [40] "Attaché attire wins friends"                                                       
##  [41] "Royal Navy charts new course east of Suez"                                         
##  [42] "Skunk blamed for psychosis epidemic"                                               
##  [43] "‘It was just a blur for 15 years. I had demons but I was suppressing it’"          
##  [44] "County lines drug gangs behind doubling in number of child slaves"                 
##  [45] "Gamekeepers blamed for illegal hen harrier deaths"                                 
##  [46] "Jammy Aussies knock Britain for six in marmalade contest"                          
##  [47] "Fly to New York via space in 29 minutes"                                           
##  [48] "Warning labels for lavatories to stave off water crisis"                           
##  [49] "Queen film makes Oscar-winner cringe"                                              
##  [50] "Students ‘locked out’ during Queen’s visit"                                        
##  [51] "Hotel pair held over deaths of teenagers in party crush"                           
##  [52] "Female maths genius turns the tables on chauvinists to win £500,000 prize"         
##  [53] "Buddhist with knife lashed out at partner over taunts"                             
##  [54] "Chinese investors ready to snap up struggling schools"                             
##  [55] "HS2 ‘will deepen regional divide as north loses out’"                              
##  [56] "Lineker calls foul over scam adverts on Twitter"                                   
##  [57] "From adverts to racial bias, we will take on the dangers of data"                  
##  [58] "Climate change is not taken seriously under Grayling"                              
##  [59] "Remainers get nowhere by acting like pop fandom"                                   
##  [60] "My life may have been saved this week"                                             
##  [61] "Sex education must be fit for a childhood online"                                  
##  [62] "News in pictures"                                                                  
##  [63] "Getting rid of Theresa May solves nothing"                                         
##  [64] "China wants to divide and rule in Europe"                                          
##  [65] "A teenage conviction can’t be a scar for life"                                     
##  [66] "New life born into a howling blizzard"                                             
##  [67] "Drinks producers should pay more to get rid of litter"                             
##  [68] "Cult of Victimhood"                                                                
##  [69] "Wet Rot"                                                                           
##  [70] "Liar and Fantasist"                                                                
##  [71] "Bercow’s ruling and parliamentary tradition"                                       
##  [72] "Nature notes"                                                                      
##  [73] "Birthdays today"                                                                   
##  [74] "Shooter was ‘on his way to third target’"                                          
##  [75] "Weapon used in attack sells out"                                                   
##  [76] "I still hear the screams, says mosque attack survivor"                             
##  [77] "Erdogan to West: we will send you back in coffins"                                 
##  [78] "Brazil could be the latest Nato ally, Trump suggests"                              
##  [79] "Le Monde staff fight takeover by Czech coal and gas billionaire"                   
##  [80] "Pope refuses resignation of cardinal who hid sex abuse"                            
##  [81] "Broadcaster in sexism row after going off air"                                     
##  [82] "Tram killer ‘should have been in jail’"                                            
##  [83] "‘Anti-vax’ advocate catches chickenpox"                                            
##  [84] "Russia seals £1.5bn military jet deal with Egypt"                                  
##  [85] "Merkel set to miss 1.5% defence spending target"                                   
##  [86] "Kazakh strongman hands over after 28 years in charge"                              
##  [87] "Isis fighters pinned on river bank after final defences are breached"              
##  [88] "Hundreds are feared dead in cyclone floods"                                        
##  [89] "Power cuts are a fact of life, South Africans told"                                
##  [90] "Trophy hunter who killed sleeping lion is marked man"                              
##  [91] "Trudeau’s top official quits ‘in disgrace’"                                        
##  [92] "Brother of Bezos’s lover accused of leak"                                          
##  [93] "Rising inflation threatens to hit living standards"                                
##  [94] "Metro Bank quizzed over £120m grant"                                               
##  [95] "EU fines Google €1.5bn for breach of competition rules"                            
##  [96] "Kingfisher boss to leave before end of DIY turnaround"                             
##  [97] "Lloyds boss loses final-salary pension"                                            
##  [98] "Office Outlet collapse risks 1,200 jobs"                                           
##  [99] "Even a skip-counter like me agrees economic indicators need updating"              
## [100] "Who’s next to take the pension hit?"                                               
## [101] "Inmarsat shares take off on hopes of bidding war"                                  
## [102] "Kier plunges to £35m losses after bins saga"                                       
## [103] "Tenants suffer as councils make millions on landlord licences"                     
## [104] "No-deal Brexit ‘as unpredictable as financial crash’"                              
## [105] "First deep coal mine since 1980s approved"                                         
## [106] "Ocado upbeat despite depot fire setback"                                           
## [107] "Push to replace polluting gas boilers ‘risks backlash’"                            
## [108] "British film group Foundry sold to America’s Roper"                                
## [109] "Wait before Brexit stimulus, Bank is urged"                                        
## [110] "Asos brought down a peg as it fails to deliver"                                    
## [111] "Regulator ‘stunned’ by Musk’s tweeting"                                            
## [112] "Asda and Sainsbury’s pledge price cuts of 10 per cent to save merger"              
## [113] "Private equity has to watch its reputation as well as the bottom line"             
## [114] "Kingfisher chief executive departs"                                                
## [115] "Query for watchdog over failed savings firm"                                       
## [116] "Aberdeen wins £109bn dispute with Lloyds group"                                    
## [117] "Ignore Bramson’s siren call, Barclays tells investors"                             
## [118] "Octopus targets its tentacles on every home"                                       
## [119] "UBS fined £27m for millions of errors"                                             
## [120] "Cold water poured on Dunkerton’s Superdry bid"                                     
## [121] "New Kier boss back on top after Carillion near-miss"                               
## [122] "Wood Group’s debt reduction hit by slow oil recovery"                              
## [123] "Evraz bottom of league as Abramovich sells shares"                                 
## [124] "Accounts are delayed at Ferrexpo"                                                  
## [125] "Engineer can handle the pressure"                                                  
## [126] "Sainsbury’s boss ready to leave his demons behind"                                 
## [127] "Your three-minute digest"                                                          
## [128] "Everything you need to know about Isas in 2019"                                    
## [129] "Anne Ashworth: Hesitating? Don’t pass up a chance to save tax-free"                
## [130] "A helping hand on to the property ladder"                                          
## [131] "Look to emerging markets for some thrills"                                         
## [132] "Benefits of taking an ethical path"                                                
## [133] "Start early on the saving journey"                                                 
## [134] "A safe place for cash in times of turmoil"                                         
## [135] "Learn to converse in Isa — the terminology decoded"                                
## [136] "Five reasons to take out an Isa this tax year"                                     
## [137] "Worried about Corbyn? Here’s how to protect your money"                            
## [138] "The Isa strategy to follow if you want a comfortable retirement"                   
## [139] "Save £100 a month and watch your profit blossom"                                   
## [140] "A brief history of Isas in facts and figures"                                      
## [141] "A quick guide to stock-market speak"                                               
## [142] "How to find the perfect Isa fit"                                                   
## [143] "Funds to pick if you want growth"                                                  
## [144] "Funds to choose for income"                                                        
## [145] "The best of the online investors"                                                  
## [146] "Counties would be wrong to give Smith a helping hand"                              
## [147] "Smith wants county deal before Ashes"                                              
## [148] "‘I was not handled well – England would admit that’"                               
## [149] "Litmanen: Liverpool’s creative genius who made stars of those around him"          
## [150] "Forget statistics – Southgate has enviable choice"                                 
## [151] "‘It is good to be nice but I also needed a bit more bite’"                         
## [152] "Wembley trip a fine reward for Hughton, a manager defined by dignity and diligence"
## [153] "It’s not quite Moyes at United but Wales’ new coach Pivac has tough act to follow" 
## [154] "Barclays’ £10m investment is huge step forward for women’s football"               
## [155] "PFA: still too few Englishmen in Premier League top six"                           
## [156] "Foul Messi: Mourinho’s leaked plan for Chelsea"                                    
## [157] "Q&A: Everything you need to know about Euro 2020 qualification"                    
## [158] "Rashford an injury worry for Czech Republic game"                                  
## [159] "Six Nations Dissected: Farrell abandoned in his hour of need"                      
## [160] "Wales players could leave over pay, says Anscombe"                                 
## [161] "The Game Dissected: how Sigurdsson makes Everton tick"                             
## [162] "Cult heroes: the Newcastle winger who had flair, flamboyance – and was gorgeous"   
## [163] "Cult heroes: Cambridge striker who wore odd boots and punched his manager"         
## [164] "Anyone for tennis . . . at a football arena?"                                      
## [165] "Team Sky will race as Team Ineos at Tour de Yorkshire in May"                      
## [166] "Premiership clubs to propose relegation play-off as compromise"                    
## [167] "South Africa bowler eyes England cap"                                              
## [168] "Giggs fires back at Ibrahimovic over United criticism"                             
## [169] "BBC signs new four-year deal to show FA Cup"                                       
## [170] "Silva fined £12,000 but avoids touchline ban"                                      
## [171] "Derby owner trying to sell club for £1"                                            
## [172] "Bolton could lose 12 points if they go into administration"                        
## [173] "The day England were sent tumbling out by Maradona"                                
## [174] "Baseball player lands richest deal in sport at £325m"                              
## [175] "Japan Olympic chief quits amid corruption investigation"                           
## [176] "Times Sport Unseen: the best of our photographers’ pictures this week"             
## [177] "Champion tipster of the year Rob Wright’s racing tips"                             
## [178] "Commander William Hucklesby"                                                       
## [179] "Pamela Portman-Aitken"                                                             
## [180] "Major-General Geoffrey Field"                                                      
## [181] "Professor Edward Burn"                                                             
## [182] "March 19"                                                                          
## [183] "More death than births"                                                            
## [184] "Spring starts with a supermoon"                                                    
## [185] "Crossword Club"                                                                    
## [186] "Times Concise No 7917"                                                             
## [187] "Times Quick Cryptic No 1312"                                                       
## [188] "Times Cryptic No 27303"                                                            
## [189] "Concise Quintagram No 328"                                                         
## [190] "Cryptic Quintagram No 328"                                                         
## [191] "Sudoku No 10577 Super fiendish"                                                    
## [192] "Sudoku No 10576 Fiendish"                                                          
## [193] "Sudoku No 10575 Difficult"                                                         
## [194] "Killer Sudoku No 6493 Deadly"                                                      
## [195] "Killer Sudoku No 6492 Tricky"                                                      
## [196] "Brain Trainer No 2832"                                                             
## [197] "Cell Blocks No 3484"                                                               
## [198] "Codeword No 3601"                                                                  
## [199] "Futoshiki No 3393"                                                                 
## [200] "Kakuro No 2352"                                                                    
## [201] "KenKen No 4593"                                                                    
## [202] "Lexica No 4706"                                                                    
## [203] "Lexica No 4705"                                                                    
## [204] "Polygon"                                                                           
## [205] "Set Square No 2355"                                                                
## [206] "Suko No 2502"                                                                      
## [207] "Bridge"                                                                            
## [208] "Chess"                                                                             
## [209] "Clashtastic! The power of pattern"                                                 
## [210] "Anna Murphy: Who wants humdrum? Time to seek out the special"                      
## [211] "The new casual — why it’s time to wear a boilersuit"                               
## [212] "The best basket bags"                                                              
## [213] "Next time I’m asked how antisemitism started, I’ll say: go to this exhibition"     
## [214] "Mum made a porno. It was really good"                                              
## [215] "Carol Midgley: Even Pippa has been body-shamed. Full marks for effort, shamers"    
## [216] "The Times Daily Quiz"                                                              
## [217] "Brad and Leo"                                                                      
## [218] "The Bay at Nice at the Menier Chocolate Factory, SE1"                              
## [219] "Britten Sinfonia/Brad Mehldau at the Barbican"                                     
## [220] "Liza Pulman Sings Streisand at the Lyric, W1"                                      
## [221] "Libuse at the Bloomsbury Theatre, WC1"                                             
## [222] "OAE/Schiff at the Royal Festival Hall"                                             
## [223] "Seann Walsh at the Stables, Milton Keynes"                                         
## [224] "TV review: The Internet’s Dirtiest Secrets: The Cleaners; Shetland"                
## [225] "Lindsey Bareham’s fennel and carrot soup with lemon and bacon"                     
## [226] "What’s on TV tonight"                                                              
## [227] "Ben is Back"                                                                       
## [228] "Revealed: Edinburgh schools worst at hitting exam targets"                         
## [229] "Dozens fall ill due to problems with hospital buildings"                           
## [230] "Awesome buzz about peculiar monikers"                                              
## [231] "Unemployment lower than the UK but wages still lag behind"                         
## [232] "Army ‘failed woman murdered by obsessed soldier’"                                  
## [233] "Price of holidays to Europe halves before Brexit day"                              
## [234] "No Asian head teachers in Scotland"                                                
## [235] "Results expose the rich-poor divide"                                               
## [236] "Show The Evidence"                                                                 
## [237] "Hard lessons emerge from patchy data"                                              
## [238] "Chemicals leaking into Clyde an urgent risk, agency warns"                         
## [239] "Life in prison for wife and friends who murdered husband in his sleep"             
## [240] "Firefighters travelling 100 miles to provide emergency cover"                      
## [241] "Let Scotland set drugs laws so we can open safe injecting rooms, charities urge"   
## [242] "New law for injury payouts set to push up insurance costs"                         
## [243] "Hopes dashed in US hunt for Scottish fugitive"                                     
## [244] "Rail can revitalise Borders for fraction of London projects"                       
## [245] "Why not turn the Clyde green for St Patrick?"                                      
## [246] "Ofgem set to back Shetland electricity link"                                       
## [247] "North Sea’s £200bn bill to keep the oil flowing"                                   
## [248] "Pay for mosque security, Sarwar tells SNP"                                         
## [249] "Hypnotherapist is cleared of sex assaults"                                         
## [250] "Body found in grounds of primary school"                                           
## [251] "International Baccalaureate school to open this summer"                            
## [252] "Academic is targeted by racist group"                                              
## [253] "Panorama of Victorian Glasgow makes us question our future"                        
## [254] "Beavers win protection"                                                            
## [255] "Tierney in as Robertson loses fitness battle"                                      
## [256] "Morelos will be allowed to face Celtic"                                            
## [257] "‘I didn’t sleep after Conor McGregor sent me a message’"                           
## [258] "I’m not Carlton Palmer’s son, says new right back"                                 
## [259] "First-class approach is long overdue"                                              
## [260] "Kazakhstan investing in the future, warns Duff"                                    
## [261] "Barclay’s experience could be key for run-in"                                      
## [262] "Stop kicking Brexit down the road, EU warns May"                                   
## [263] "Sport Ireland seeks answers on Delaney cheque"                                     
## [264] "Futuristic way to regenerate the Phoenix"                                          
## [265] "I will follow: Bono’s son releases record"                                         
## [266] "Pair arrested after disco tragedy"                                                 
## [267] "Mourners pay tribute to victims of disco crush"                                    
## [268] "Schools open to support shocked pupils"                                            
## [269] "Sadness of a community missing three young stars"                                  
## [270] "FAI signs sponsorship with betting company"                                        
## [271] "‘Nothing could be done’ to save man drowned in storm"                              
## [272] "When the recession comes, it will be horrid"                                       
## [273] "Ardern shows what leadership looks like"                                           
## [274] "Setback for Supermac’s Europe bid"                                                 
## [275] "Household debt ratio at 2003 level but still among EU’s highest"                   
## [276] "Politics holds key to property taxes"                                              
## [277] "Applegreen sees no holes in Brexit road"                                           
## [278] "Citigroup scales up operations in Dublin"                                          
## [279] "Avolon flying high to reduce debt ratio"                                           
## [280] "Hammer time offers hope to Mincon"                                                 
## [281] "Irish job hunters face EU’s second lowest vacancy rate"                            
## [282] "Maybe the Mob can stop May’s deal getting whacked"                                 
## [283] "Mumps cases this year higher than whole of 2018"                                   
## [284] "Patients ‘paying the price’ for shortage of consultants"                           
## [285] "Pine marten may be the saviour of red squirrel"                                    
## [286] "Minister rejects ticket website’s legal warning over touting ban"                  
## [287] "Hellish account of madness and recovery sets literary world alight"                
## [288] "Arts and heritage ‘ruined by rising insurance costs’"                              
## [289] "New appeal for wife who disappeared two years ago"                                 
## [290] "Leave governing to robots, a quarter of Europeans say"                             
## [291] "Closing ward in crowded hospital ‘beggars belief’"                                 
## [292] "Slippery Slope"                                                                    
## [293] "Farmers must be encouraged to let their hedges grow"                               
## [294] "Maguire given the nod to lead Irish attack"                                        
## [295] "McCarthy’s mission: win games and put bums on seats"                               
## [296] "Keogh toughing it out in bid to be part of new era"                                
## [297] "McFadden: No better man than Schmidt to right the wrongs"                          
## [298] "Henderson set to return for showdown with Leinster"                                
## [299] "Don’t write off Ireland, says Marshall"                                            
## [300] "Friend praises international trio for impact"                                      
## [301] "Reaching top tier of league is only half the battle"                               
## [302] "Gutsy Barnes ready to hang up gloves after career to remember"                     
## [303] "Dangerous to write off coaching greats prematurely"

APIs

  • API: application programming interface

  • Structured way for programs to talk to each other

  • Lots of organisations provide them: private sector and public sector

Gagging clauses 1

APIs

https://www.contractsfinder.service.gov.uk/Published/Notices/OCDS/Search?stages=award&order=ASC&page=1

Open contracting API data

Electric cars

APIs

  • Not always documented

  • Dynamic websites use APIs behind the scenes

  • How can we access these?

Zap-Map 1

Zap-Map 2

Zap-Map 3

Zap-Map 4

Amalgamation

Joins

  • Derived from relational algebra; popularised by SQL databases

  • Think of an Excel VLOOKUP function

  • Basic but misunderstood…

Scary joins diagram

Joins

Inner joins

  • Find matching rows between two tables based on a column

  • Create a new table with just the matching rows

Outer joins

  • Find matching rows between two tables based on a column

  • Stick the matches from table B on to the right-hand side of table A (or vice versa)

Nice joins diagram

Fuzzy matching

  • Data entered by humans is messy

  • Louis Goddard / Louis Godard / L Goddard / Goddard, Louis / Mr Louis Goddard

  • Needs to be standardised before it can be joined

Tax 1

Land map

Fuzzy matching

##     country.x   country.y year tax_exiles
## 1      Monaco      Monaco 2013         56
## 2      Monaco      monaco 2014         23
## 3      Monaco     Monnaco 2015         35
## 4 Switzerland Switzerland 2013        245

Analysis

Statistical analysis

  • Difficult to write stories based on statistics. Hard to grasp intuitively

  • Issues of trust. What constitutes an outlier?

  • Visualisation can help, but only so much

Ski doping

Statistical analysis

  • Database of blood test results leaked to Insight team and ARD

  • Score generated based on consultation with two expert sources

  • Suspicious results for medal-winners sent to experts for confirmation

Geospatial analysis

  • Huge opportunities: not many data journalists do it well!

  • Not all about mapping. Useful for amalgamation of data sources, e.g. how many Xs are there in Y area?

  • Benefits from using reproducible workflows rather than graphical GIS software

Church land

Geospatial analysis

  • Shapefiles of licence areas from Oil and Gas Authority

  • Data on commercial and corporate land ownership from Land Registry

  • Postcode location data from Ordnance Survey

Tools

What is R?

  • Developed as a statistical language in the ’90s, based on S

  • A ‘scripting language’: used for quickly arranging workflows rather than building software

  • Very modular and extensible – can be adapted for different purposes

R vs. Python

  • Data journalists and academics vs. developers

  • Tidyverse vs. Pandas

  • Vectors and pipes vs. loops (and other stylistic differences)

  • Jupyter vs. RMarkdown

The R community

  • Lots of non-programmers = lots of support available online

  • Stack Overflow – Q&A website

  • #rstats hashtag on Twitter

  • RStudio Community forum

RStudio IDE

RStudio

  • Integrated development environment (IDE)

  • Basically a text editor with bells and whistles

  • Not essential for working with R, but makes it much easier!

RStudio

  • Also the name of the company that makes RStudio

  • Employs lots of R package developers and maintainers, particularly with Tidyverse packages

  • A force for good in the R ecosystem! For now, anyway…

Hadley

Tidyverse

  • A collection of packages that work together to make data science (and data journalism) easier

  • Key packages: readr (reading and writing data), dplyr (transforming data), tidyr (tidying data), ggplot2 (visualisation)

R4DS

r4ds.had.co.nz

RMarkdown

  • ‘Literate programming’ – code and explanation flow seamlessly together

  • Plain-text, human-readable format – an advantage over Jupyter notebooks

  • Output to many different formats: PDF reports, web pages, even slides!

Data sources

Companies House

  • Free company data product

    • Simple CSV with details on every active company

    • Name, address, SIC codes, accounts data, etc.

  • Persons with significant control (PSC) snapshot

    • Big NDJSON file showing people who control UK companies

    • Needs a bit of processing, but extremely useful

Land Registry

  • Commercial and corporate ownership data (CCOD)

    • Land in England and Wales owned by companies and corporate entities (e.g. government, the Church, etc.)

    • Doesn’t include reliable geolocation information 😭 (not all addresses have postcodes)

  • Overseas companies ownership data (OCOD)

    • The same but for overseas companies, generally registered in tax havens like BVI

Other stuff

  • Cabinet Office open contracting data

    • API giving details of all contract tenders and awards

    • Tussell sells this for £700/month – and it’s all free!

  • Energy Performance Certificate data

  • Food hygiene rating data

  • Think laterally!

louisg.xyz/birkbeck